Data-driven methods for dengue prediction and surveillance using real-world and Big Data: A systematic review

Background Traditionally, dengue surveillance is based on case reporting to a central health agency. However, the delay between a case and its notification can limit the system responsiveness. Machine learning methods have been developed to reduce the reporting delays and to predict outbreaks, based on non-traditional and non-clinical data sources. The aim of this systematic review was to identify studies that used real-world data, Big Data and/or machine learning methods to monitor and predict dengue-related outcomes. Methodology/Principal findings We performed a search in PubMed, Scopus, Web of Science and grey literature between January 1, 2000 and August 31, 2020. The review (ID: CRD42020172472) focused on data-driven studies. Reviews, randomized control trials and descriptive studies were not included. Among the 119 studies included, 67% were published between 2016 and 2020, and 39% used at least one novel data stream. The aim of the included studies was to predict a dengue-related outcome (55%), assess the validity of data sources for dengue surveillance (23%), or both (22%). Most studies (60%) used a machine learning approach. Studies on dengue prediction compared different prediction models, or identified significant predictors among several covariates in a model. The most significant predictors were rainfall (43%), temperature (41%), and humidity (25%). The two models with the highest performances were Neural Networks and Decision Trees (52%), followed by Support Vector Machine (17%). We cannot rule out a selection bias in our study because of our two main limitations: we did not include preprints and could not obtain the opinion of other international experts. Conclusions/Significance Combining real-world data and Big Data with machine learning methods is a promising approach to improve dengue prediction and monitoring. Future studies should focus on how to better integrate all available data sources and methods to improve the response and dengue management by stakeholders.


Types of study to be included Types of study to be included
We will include all epidemiological studies that use Big Data methods to predict: dengue outbreaks, dengue We will include all epidemiological studies that use Big Data methods to predict: dengue outbreaks, dengue outcomes, dengue severity. outcomes, dengue severity. Studies with no original data (reviews, editorials, guidelines, perspective pieces), randomized controlled Studies with no original data (reviews, editorials, guidelines, perspective pieces), randomized controlled trials, case series and case reports will not be included. trials, case series and case reports will not be included. Descriptive epidemiological studies without any prediction model will not be included. Descriptive epidemiological studies without any prediction model will not be included. Studies focusing on other types of arboviruses will not be included. Studies focusing on other types of arboviruses will not be included. We will not include studies focusing exclusively on mosquitoes or in vitro studies. We will not include studies focusing exclusively on mosquitoes or in vitro studies.

Condition or domain being studied Condition or domain being studied
We will focus on Dengue virus (DENV) which is one of the most important vector-borne diseases in the We will focus on Dengue virus (DENV) which is one of the most important vector-borne diseases in the world. world. The majority of DENV infections are asymptomatic or are characterized by intense flu-like symptoms lasting The majority of DENV infections are asymptomatic or are characterized by intense flu-like symptoms lasting up to 10 days afterward but they can evolve into the severe forms of dengue hemorrhagic fever/dengue up to 10 days afterward but they can evolve into the severe forms of dengue hemorrhagic fever/dengue shock syndrome (DHF/DSS) which can lead to death. However, mortality due to dengue can be greatly shock syndrome (DHF/DSS) which can lead to death. However, mortality due to dengue can be greatly reduced by early diagnosis, which will influence appropriate clinical management. reduced by early diagnosis, which will influence appropriate clinical management. Most dengue-endemic regions (South-East Asia, the Americas and the Pacific for the most seriously Most dengue-endemic regions (South-East Asia, the Americas and the Pacific for the most seriously affected) rely on traditional surveillance, based on hospital syndromic monitoring and laboratory affected) rely on traditional surveillance, based on hospital syndromic monitoring and laboratory confirmation of a subset of cases reported to a central health agency. While this method is generally very confirmation of a subset of cases reported to a central health agency. While this method is generally very accurate; it can be very slow and expensive due to the time needed to aggregate data, with substantial accurate; it can be very slow and expensive due to the time needed to aggregate data, with substantial delays between an event and notifications delays between an event and notifications On the other hand, numerous studies have successfully used mobile, digital and Internet based systems to On the other hand, numerous studies have successfully used mobile, digital and Internet based systems to crowd-source data from the community. These new sources of data have been already used in pilot studies crowd-source data from the community. These new sources of data have been already used in pilot studies to improve monitoring and clinical management and predict dengue outbreaks. to improve monitoring and clinical management and predict dengue outbreaks.

Participants/population Participants/population
We will include all people with dengue, regardless of age, gender or severity of the disease. We will include all people with dengue, regardless of age, gender or severity of the disease.

Intervention(s), exposure(s) Intervention(s), exposure(s)
We are interested in studies using Big Data methods. According to the MeSH definition of Big Data, this We are interested in studies using Big Data methods. According to the MeSH definition of Big Data, this means all methods applied on "extremely large amounts of data which require rapid and often complex means all methods applied on "extremely large amounts of data which require rapid and often complex computational analyses to reveal patterns, trends, and associations, relating to various facets of human and computational analyses to reveal patterns, trends, and associations, relating to various facets of human and non-human entities". non-human entities". Regarding machine-learning methods, it can be defined as any computer-derived mathematical algorithm Regarding machine-learning methods, it can be defined as any computer-derived mathematical algorithm using learning to classify data. It includes: using learning to classify data. It includes: -Supervised machine learning -Supervised machine learning -Unsupervised machine learning -Unsupervised machine learning -Deep learning -Deep learning

Comparator(s)/control Comparator(s)/control
Not applicable Not applicable

Context Context
The diagnostic of dengue in the included papers should be established using any of the standard WHO The diagnostic of dengue in the included papers should be established using any of the standard WHO definition and classifications (1997( or 2009. definition and classifications (1997( or 2009.
Research in low and middle-income countries will also be included.
Research in low and middle-income countries will also be included.

Main outcome(s) Main outcome(s)
-Number and type of Big Data methods and/or machine learning models used to predict or forecast a -Number and type of Big Data methods and/or machine learning models used to predict or forecast a dengue outbreak and their performance (Recall, Precision, F-measure) dengue outbreak and their performance (Recall, Precision, F-measure) -Number and type of Big Data methods and/or machine learning models used to predict or forecast a -Number and type of Big Data methods and/or machine learning models used to predict or forecast a severe dengue outbreak and their performance (Recall, Precision, F-measure) severe dengue outbreak and their performance (Recall, Precision, F-measure)

Measures of effect Measures of effect
Not applicable Not applicable

Additional outcome(s) Additional outcome(s)
None None

Measures of effect Measures of effect
Not applicable Not applicable

Data extraction (selection and coding) Data extraction (selection and coding)
Two authors from the review team will independently extract outcome data from each study using a Two authors from the review team will independently extract outcome data from each study using a Microsoft Excel collection form. In case of disagreement, a third reviewer will help to reach a consensus. Microsoft Excel collection form. In case of disagreement, a third reviewer will help to reach a consensus.
The data collection and extraction will proceed as follows.
The data collection and extraction will proceed as follows. 3. Data sources used for the models 3. Data sources used for the models 4. Outcomes predicted 4. Outcomes predicted 5. Machine learning models employed and role in the study 5. Machine learning models employed and role in the study 6. Models performance and evaluation 6. Models performance and evaluation We will use Zotero for managing the search and writing the review. We will use Zotero for managing the search and writing the review.
The extracted data will be stored in Microsoft Excel individually.
The extracted data will be stored in Microsoft Excel individually.

Risk of bias (quality) assessment Risk of bias (quality) assessment
Each reviewer will independently assess risk of bias using the Prediction model Risk Of Bias ASsessment Each reviewer will independently assess risk of bias using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). Tool (PROBAST).

Strategy for data synthesis Strategy for data synthesis
We will summarise the results using descriptive statistics and a narrative synthesis. We will summarise the results using descriptive statistics and a narrative synthesis. For the narrative synthesis, we will identify common patterns and compile the results into sub categories: For the narrative synthesis, we will identify common patterns and compile the results into sub categories: -Most frequent methods used (and machine learning category) -Most frequent methods used (and machine learning category) -Performance and evaluation of each model -Performance and evaluation of each model -Most frequent data sources used -Most frequent data sources used -Contribution of non-clinical data versus clinical data -Contribution of non-clinical data versus clinical data -Influence of study participants on the model performance (scientific outcomes?) -Influence of study participants on the model performance (scientific outcomes?) No meta-analysis will be conducted for this review.
No meta-analysis will be conducted for this review. The record owner confirms that the information they have supplied for this submission is accurate and The record owner confirms that the information they have supplied for this submission is accurate and complete and they understand that deliberate provision of inaccurate information or omission of data may complete and they understand that deliberate provision of inaccurate information or omission of data may be construed as scientific misconduct. be construed as scientific misconduct.

Type and method of review Type and method of review
The record owner confirms that they will update the status of the review when it is completed and will add The record owner confirms that they will update the status of the review when it is completed and will add publication details in due course. publication details in due course.

Versions Versions
28 April 2020 28 April 2020 PROSPERO PROSPERO This information has been provided by the named contact for this review. CRD has accepted this information in good This information has been provided by the named contact for this review. CRD has accepted this information in good faith and registered the review in PROSPERO. The registrant confirms that the information supplied for this faith and registered the review in PROSPERO. The registrant confirms that the information supplied for this submission is accurate and complete. CRD bears no responsibility or liability for the content of this registration submission is accurate and complete. CRD bears no responsibility or liability for the content of this registration record, any associated files or external websites. record, any associated files or external websites.